#1: UNDERSTANDING THE PROBLEM STATEMENT & USE CASE¶

Problem Statement: Predicting the used car prices by discovering the features which affect the price.¶

Use Case: This Project can be used by car dealers or companies like CARS24 & SPINNY to predict used car prices and understand the key factors involved in the same.¶

#2: IMPORTING & INSTALLING REQUIRED LIBRARIES/DATASETS¶

In [16]:
!pip install xgboost
!pip install plotly
Requirement already satisfied: xgboost in ./opt/anaconda3/lib/python3.9/site-packages (1.6.1)
Requirement already satisfied: scipy in ./opt/anaconda3/lib/python3.9/site-packages (from xgboost) (1.7.3)
Requirement already satisfied: numpy in ./opt/anaconda3/lib/python3.9/site-packages (from xgboost) (1.21.5)
Requirement already satisfied: plotly in ./opt/anaconda3/lib/python3.9/site-packages (5.6.0)
Requirement already satisfied: six in ./opt/anaconda3/lib/python3.9/site-packages (from plotly) (1.16.0)
Requirement already satisfied: tenacity>=6.2.0 in ./opt/anaconda3/lib/python3.9/site-packages (from plotly) (8.0.1)
In [17]:
import numpy as np                                                          # Using mutli-dimensional array objects                                                                
import pandas as pd                                                         # Data Manipulation
import seaborn as sns                                                       # Data Visualization
import matplotlib.pyplot as plt                                             # Data Visualization 
import plotly.express as px                                                 # Interacative Data Visualization 
from plotly.offline import download_plotlyjs,init_notebook_mode,plot,iplot  # Offline version of the Plotly modules
import warnings                                                             # Ignoring the warnings 
warnings.filterwarnings("ignore") 
from sklearn.model_selection import train_test_split                        # Machine Learning
In [18]:
df = pd.read_csv("cars_data.csv")
In [19]:
df.head(5)
Out[19]:
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura MDX SUV Asia All $36,945 $33,337 3.5 6.0 265 17 23 4451 106 189
1 Acura RSX Type S 2dr Sedan Asia Front $23,820 $21,761 2.0 4.0 200 24 31 2778 101 172
2 Acura TSX 4dr Sedan Asia Front $26,990 $24,647 2.4 4.0 200 22 29 3230 105 183
3 Acura TL 4dr Sedan Asia Front $33,195 $30,299 3.2 6.0 270 20 28 3575 108 186
4 Acura 3.5 RL 4dr Sedan Asia Front $43,755 $39,014 3.5 6.0 225 18 24 3880 115 197
In [20]:
df.tail(5)
Out[20]:
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
423 Volvo C70 LPT convertible 2dr Sedan Europe Front $40,565 $38,203 2.4 5.0 197 21 28 3450 105 186
424 Volvo C70 HPT convertible 2dr Sedan Europe Front $42,565 $40,083 2.3 5.0 242 20 26 3450 105 186
425 Volvo S80 T6 4dr Sedan Europe Front $45,210 $42,573 2.9 6.0 268 19 26 3653 110 190
426 Volvo V40 Wagon Europe Front $26,135 $24,641 1.9 4.0 170 22 29 2822 101 180
427 Volvo XC70 Wagon Europe All $35,145 $33,112 2.5 5.0 208 20 27 3823 109 186

#3: EDA¶

In [21]:
df.columns
Out[21]:
Index(['Make', 'Model', 'Type', 'Origin', 'DriveTrain', 'MSRP', 'Invoice',
       'EngineSize', 'Cylinders', 'Horsepower', 'MPG_City', 'MPG_Highway',
       'Weight', 'Wheelbase', 'Length'],
      dtype='object')
In [22]:
df.shape
Out[22]:
(428, 15)
In [23]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 428 entries, 0 to 427
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Make         428 non-null    object 
 1   Model        428 non-null    object 
 2   Type         428 non-null    object 
 3   Origin       428 non-null    object 
 4   DriveTrain   428 non-null    object 
 5   MSRP         428 non-null    object 
 6   Invoice      428 non-null    object 
 7   EngineSize   428 non-null    float64
 8   Cylinders    426 non-null    float64
 9   Horsepower   428 non-null    int64  
 10  MPG_City     428 non-null    int64  
 11  MPG_Highway  428 non-null    int64  
 12  Weight       428 non-null    int64  
 13  Wheelbase    428 non-null    int64  
 14  Length       428 non-null    int64  
dtypes: float64(2), int64(6), object(7)
memory usage: 50.3+ KB
In [24]:
# Converting MSRP and Invoice to integer datatype by removing $ & (,)
df["MSRP"] = df["MSRP"].str.replace("$", "")
df["MSRP"] = df["MSRP"].str.replace(",", "")
df["MSRP"] = df["MSRP"].astype(int)
df["Invoice"] = df["Invoice"].str.replace("$", "")
df["Invoice"] = df["Invoice"].str.replace(",", "")
df["Invoice"] = df["Invoice"].astype(int)
In [25]:
df.head()
Out[25]:
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura MDX SUV Asia All 36945 33337 3.5 6.0 265 17 23 4451 106 189
1 Acura RSX Type S 2dr Sedan Asia Front 23820 21761 2.0 4.0 200 24 31 2778 101 172
2 Acura TSX 4dr Sedan Asia Front 26990 24647 2.4 4.0 200 22 29 3230 105 183
3 Acura TL 4dr Sedan Asia Front 33195 30299 3.2 6.0 270 20 28 3575 108 186
4 Acura 3.5 RL 4dr Sedan Asia Front 43755 39014 3.5 6.0 225 18 24 3880 115 197
In [26]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 428 entries, 0 to 427
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Make         428 non-null    object 
 1   Model        428 non-null    object 
 2   Type         428 non-null    object 
 3   Origin       428 non-null    object 
 4   DriveTrain   428 non-null    object 
 5   MSRP         428 non-null    int64  
 6   Invoice      428 non-null    int64  
 7   EngineSize   428 non-null    float64
 8   Cylinders    426 non-null    float64
 9   Horsepower   428 non-null    int64  
 10  MPG_City     428 non-null    int64  
 11  MPG_Highway  428 non-null    int64  
 12  Weight       428 non-null    int64  
 13  Wheelbase    428 non-null    int64  
 14  Length       428 non-null    int64  
dtypes: float64(2), int64(8), object(5)
memory usage: 50.3+ KB
In [27]:
round(df.describe(),2)
Out[27]:
MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
count 428.00 428.00 428.00 426.00 428.00 428.00 428.00 428.00 428.00 428.00
mean 32774.86 30014.70 3.20 5.81 215.89 20.06 26.84 3577.95 108.15 186.36
std 19431.72 17642.12 1.11 1.56 71.84 5.24 5.74 758.98 8.31 14.36
min 10280.00 9875.00 1.30 3.00 73.00 10.00 12.00 1850.00 89.00 143.00
25% 20334.25 18866.00 2.38 4.00 165.00 17.00 24.00 3104.00 103.00 178.00
50% 27635.00 25294.50 3.00 6.00 210.00 19.00 26.00 3474.50 107.00 187.00
75% 39205.00 35710.25 3.90 6.00 255.00 21.25 29.00 3977.75 112.00 194.00
max 192465.00 173560.00 8.30 12.00 500.00 60.00 66.00 7190.00 144.00 238.00
In [28]:
#checking duplicates
df.duplicated().sum()
Out[28]:
0
In [29]:
#checking no. of null values in each column
df.isnull().sum()
Out[29]:
Make           0
Model          0
Type           0
Origin         0
DriveTrain     0
MSRP           0
Invoice        0
EngineSize     0
Cylinders      2
Horsepower     0
MPG_City       0
MPG_Highway    0
Weight         0
Wheelbase      0
Length         0
dtype: int64
In [30]:
#checking % of null values in each column
df.isnull().sum() * 100 / len(df)
Out[30]:
Make           0.00000
Model          0.00000
Type           0.00000
Origin         0.00000
DriveTrain     0.00000
MSRP           0.00000
Invoice        0.00000
EngineSize     0.00000
Cylinders      0.46729
Horsepower     0.00000
MPG_City       0.00000
MPG_Highway    0.00000
Weight         0.00000
Wheelbase      0.00000
Length         0.00000
dtype: float64
In [31]:
#handling null values by dropping them since they are extremely less
df = df.dropna()
In [32]:
df.isnull().sum()
Out[32]:
Make           0
Model          0
Type           0
Origin         0
DriveTrain     0
MSRP           0
Invoice        0
EngineSize     0
Cylinders      0
Horsepower     0
MPG_City       0
MPG_Highway    0
Weight         0
Wheelbase      0
Length         0
dtype: int64

#4.1: Univariate Analysis¶

In [33]:
df.Make.unique()
Out[33]:
array(['Acura', 'Audi', 'BMW', 'Buick', 'Cadillac', 'Chevrolet',
       'Chrysler', 'Dodge', 'Ford', 'GMC', 'Honda', 'Hummer', 'Hyundai',
       'Infiniti', 'Isuzu', 'Jaguar', 'Jeep', 'Kia', 'Land Rover',
       'Lexus', 'Lincoln', 'MINI', 'Mazda', 'Mercedes-Benz', 'Mercury',
       'Mitsubishi', 'Nissan', 'Oldsmobile', 'Pontiac', 'Porsche', 'Saab',
       'Saturn', 'Scion', 'Subaru', 'Suzuki', 'Toyota', 'Volkswagen',
       'Volvo'], dtype=object)
In [34]:
fig = px.histogram(df, x = "Make",labels = {"Make":"Manufacturer"},title = "MAKE OF THE CAR",color_discrete_sequence = ["blue"])
fig.show()
In [35]:
df.Type.unique()
Out[35]:
array(['SUV', 'Sedan', 'Sports', 'Wagon', 'Truck', 'Hybrid'], dtype=object)
In [36]:
fig = px.histogram(df, x = "Type",labels = {"Type":"Type"},title = "TYPE OF THE CAR",color_discrete_sequence = ["green"])           
fig.show()
In [37]:
df.Origin.unique()
Out[37]:
array(['Asia', 'Europe', 'USA'], dtype=object)
In [38]:
fig = px.histogram(df, x = "Origin",labels = {"Origin":"Origin"},title = "LOCATION OF THE CAR SALES",color_discrete_sequence = ["orange"])     
fig.show()
In [39]:
df.DriveTrain.unique()
Out[39]:
array(['All', 'Front', 'Rear'], dtype=object)
In [40]:
fig = px.histogram(df, x = "DriveTrain",labels = {"DriveTrain":"Drivetrain"},title = "DRIVE TRAIN OF THE CAR",color_discrete_sequence = ["red"])
fig.show()
In [41]:
fig = px.violin(df, x = "EngineSize",labels = {"EngineSize":"EngineSize"},title = "ENGINE SIZE OF THE CARS",color_discrete_sequence = ["purple"])
fig.show()
In [42]:
fig = px.violin(df, x = "Cylinders",labels = {"Cylinders":"Cylinders"},title = "CYLINDERS OF THE CARS",color_discrete_sequence = ["blue"])
fig.show()
In [43]:
fig = px.violin(df, x = "Horsepower",labels = {"Horsepower":"Horsepower"},title = "HORSEPOWER OF THE CARS",color_discrete_sequence = ["green"])
fig.show()
In [44]:
fig = px.violin(df, x = "MPG_City",labels = {"MPG_City":"MPG_City"},title = "MPG in City OF THE CARS",color_discrete_sequence = ["yellow"])
fig.show()
In [45]:
fig = px.violin(df, x = "MPG_Highway",labels = {"MPG_Highway":"MPG_Highway"},title = "MPG IN HIGHWAY OF THE CARS",color_discrete_sequence = ["orange"])
fig.show()
In [46]:
fig = px.violin(df, x = "Weight",labels = {"Weight":"Weight"},title = "WEIGHT OF THE CAR",color_discrete_sequence = ["red"])
fig.show()
In [47]:
fig = px.violin(df, x = "Length",labels = {"Length":"Length"},title ="LENGTH OF THE CARS",color_discrete_sequence = ["purple"])
fig.show()
In [48]:
fig = px.box(df, x = "MSRP",labels = {"MSRP":"MSRP"},title ="MSRP OF THE CARS",color_discrete_sequence = ["blue"])
fig.show()
In [49]:
fig = px.box(df, x = "Invoice",labels = {"Invoice":"Invoice"},title ="INVOICE OF THE CARS",color_discrete_sequence = ["green"])
fig.show()

Univariate Analysis:

  1. There are 38 car manufactures in which Toyota has maximum cars while Hummer has least.
  2. There are 6 Types of Cars in which Sedan type is maximum(more than the cumulative sum of other types) while Hybrid type is the least.
  3. The 3 Origins of all cars: ASIA, EUROPE & USA.
  4. As expected Mileage Per Gallon of cars in Highways is more than that of Cities.
  5. The majority of MSRP lies in range of 10.28k-65k. There are certain outliers too.

4.2: Bivariate Analysis¶

In [50]:
fig = px.histogram(df, x = "Make",color = "Origin",labels = {"Make":"Manufacturer"},title = "MAKE OF THE CAR Vs LOCATION",)                
fig.show()
In [51]:
fig = px.histogram(df, x = "Make",color = "Type",labels = {"Make":"Manufacturer"},title = "MAKE OF THE CAR Vs TYPE")                  
fig.show()
In [52]:
# Plot the make of the car and its location
fig = px.histogram(df, x = "Type",color = "Origin",labels = {"Make":"Manufacturer"},title = "TYPE OF THE CAR Vs LOCATION")
fig.show()
In [53]:
# Plot the make of the car and its location
fig = px.histogram(df, x = "Type",color = "DriveTrain",labels = {"Make":"Manufacturer"},title = "TYPE OF THE CAR Vs DriveTrain")
fig.show()
In [54]:
# Plot the make of the car and its location
fig = px.histogram(df, x = "Origin",color = "DriveTrain",labels = {"Make":"Manufacturer"},title = "Location OF THE CAR Vs DriveTrain")
fig.show()
In [55]:
df_MSRP_Invoice=df[["MSRP","Invoice"]]
sns.boxplot(data=df_MSRP_Invoice)
Out[55]:
<AxesSubplot:>

Bivariate Analysis:

  1. Hybrid cars have been manufactured only in ASIA by Honda & Toyota.
  2. Porsche manufactures most Sports cars(88% of all cars they manufacture).
  3. There are no manufactures of Trucks in Europe.
  4. The values of MSRP and Invoice are highly coorelated. Thus Invoice can be dropped before applying ML model.

# 4.3: Multivariate Analysis¶

In [56]:
sns.pairplot(data = df) 
Out[56]:
<seaborn.axisgrid.PairGrid at 0x7f8f386fb730>
In [57]:
round(df.corr(),2)
Out[57]:
MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
MSRP 1.00 1.00 0.57 0.65 0.83 -0.48 -0.44 0.45 0.15 0.17
Invoice 1.00 1.00 0.57 0.65 0.82 -0.47 -0.44 0.44 0.15 0.17
EngineSize 0.57 0.57 1.00 0.91 0.79 -0.72 -0.73 0.81 0.64 0.64
Cylinders 0.65 0.65 0.91 1.00 0.81 -0.68 -0.68 0.74 0.55 0.55
Horsepower 0.83 0.82 0.79 0.81 1.00 -0.68 -0.65 0.63 0.39 0.38
MPG_City -0.48 -0.47 -0.72 -0.68 -0.68 1.00 0.94 -0.74 -0.51 -0.50
MPG_Highway -0.44 -0.44 -0.73 -0.68 -0.65 0.94 1.00 -0.79 -0.53 -0.47
Weight 0.45 0.44 0.81 0.74 0.63 -0.74 -0.79 1.00 0.76 0.69
Wheelbase 0.15 0.15 0.64 0.55 0.39 -0.51 -0.53 0.76 1.00 0.89
Length 0.17 0.17 0.64 0.55 0.38 -0.50 -0.47 0.69 0.89 1.00
In [58]:
plt.figure(figsize = (8,8))
sns.heatmap(df.corr(), cmap="YlGnBu", annot = True)
Out[58]:
<AxesSubplot:>
In [59]:
fig = px.scatter_3d(df, x='Make', y='Type', z='DriveTrain',color='Origin')
fig.show()

Multivariate Analysis: MSRP has highest positive coorelation with HORSEPOWER.

# 5: PREPARING THE DATA FOR MODEL TRAINING¶

In [45]:
df.head()
Out[45]:
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura MDX SUV Asia All 36945 33337 3.5 6.0 265 17 23 4451 106 189
1 Acura RSX Type S 2dr Sedan Asia Front 23820 21761 2.0 4.0 200 24 31 2778 101 172
2 Acura TSX 4dr Sedan Asia Front 26990 24647 2.4 4.0 200 22 29 3230 105 183
3 Acura TL 4dr Sedan Asia Front 33195 30299 3.2 6.0 270 20 28 3575 108 186
4 Acura 3.5 RL 4dr Sedan Asia Front 43755 39014 3.5 6.0 225 18 24 3880 115 197
In [46]:
# Performing One-Hot Encoding for "Make", "Model", "Type", "Origin", and "DriveTrain"
df_dum = pd.get_dummies(df, columns=["Make", "Model", "Type", "Origin", "DriveTrain"])
In [47]:
df_dum
Out[47]:
MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length ... Type_Sedan Type_Sports Type_Truck Type_Wagon Origin_Asia Origin_Europe Origin_USA DriveTrain_All DriveTrain_Front DriveTrain_Rear
0 36945 33337 3.5 6.0 265 17 23 4451 106 189 ... 0 0 0 0 1 0 0 1 0 0
1 23820 21761 2.0 4.0 200 24 31 2778 101 172 ... 1 0 0 0 1 0 0 0 1 0
2 26990 24647 2.4 4.0 200 22 29 3230 105 183 ... 1 0 0 0 1 0 0 0 1 0
3 33195 30299 3.2 6.0 270 20 28 3575 108 186 ... 1 0 0 0 1 0 0 0 1 0
4 43755 39014 3.5 6.0 225 18 24 3880 115 197 ... 1 0 0 0 1 0 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
423 40565 38203 2.4 5.0 197 21 28 3450 105 186 ... 1 0 0 0 0 1 0 0 1 0
424 42565 40083 2.3 5.0 242 20 26 3450 105 186 ... 1 0 0 0 0 1 0 0 1 0
425 45210 42573 2.9 6.0 268 19 26 3653 110 190 ... 1 0 0 0 0 1 0 0 1 0
426 26135 24641 1.9 4.0 170 22 29 2822 101 180 ... 0 0 0 1 0 1 0 0 1 0
427 35145 33112 2.5 5.0 208 20 27 3823 109 186 ... 0 0 0 1 0 1 0 1 0 0

426 rows × 483 columns

In [48]:
df_data = df_dum.drop(["Invoice"], axis = 1)
# Invoice feature do not contribute to car price prediction 
In [49]:
df_data
Out[49]:
MSRP EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length Make_Acura ... Type_Sedan Type_Sports Type_Truck Type_Wagon Origin_Asia Origin_Europe Origin_USA DriveTrain_All DriveTrain_Front DriveTrain_Rear
0 36945 3.5 6.0 265 17 23 4451 106 189 1 ... 0 0 0 0 1 0 0 1 0 0
1 23820 2.0 4.0 200 24 31 2778 101 172 1 ... 1 0 0 0 1 0 0 0 1 0
2 26990 2.4 4.0 200 22 29 3230 105 183 1 ... 1 0 0 0 1 0 0 0 1 0
3 33195 3.2 6.0 270 20 28 3575 108 186 1 ... 1 0 0 0 1 0 0 0 1 0
4 43755 3.5 6.0 225 18 24 3880 115 197 1 ... 1 0 0 0 1 0 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
423 40565 2.4 5.0 197 21 28 3450 105 186 0 ... 1 0 0 0 0 1 0 0 1 0
424 42565 2.3 5.0 242 20 26 3450 105 186 0 ... 1 0 0 0 0 1 0 0 1 0
425 45210 2.9 6.0 268 19 26 3653 110 190 0 ... 1 0 0 0 0 1 0 0 1 0
426 26135 1.9 4.0 170 22 29 2822 101 180 0 ... 0 0 0 1 0 1 0 0 1 0
427 35145 2.5 5.0 208 20 27 3823 109 186 0 ... 0 0 0 1 0 1 0 1 0 0

426 rows × 482 columns

In [50]:
# Feeding input features to X and output (MSRP) to y
X = df_data.drop("MSRP", axis = 1)
y = df_data["MSRP"]
In [51]:
X = np.array(X)
In [52]:
y = np.array(y)
In [53]:
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size = 0.2)
In [281]:
df_data.shape
Out[281]:
(426, 482)
In [282]:
X_train.shape
Out[282]:
(340, 481)
In [283]:
X_test.shape
Out[283]:
(86, 481)
In [337]:
y_train.shape
Out[337]:
(340,)
In [338]:
y_test.shape
Out[338]:
(86,)

# 5: PREPARING THE DATA FOR MODEL TRAINING¶

In [272]:
df.head()
Out[272]:
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura MDX SUV Asia All 36945 33337 3.5 6.0 265 17 23 4451 106 189
1 Acura RSX Type S 2dr Sedan Asia Front 23820 21761 2.0 4.0 200 24 31 2778 101 172
2 Acura TSX 4dr Sedan Asia Front 26990 24647 2.4 4.0 200 22 29 3230 105 183
3 Acura TL 4dr Sedan Asia Front 33195 30299 3.2 6.0 270 20 28 3575 108 186
4 Acura 3.5 RL 4dr Sedan Asia Front 43755 39014 3.5 6.0 225 18 24 3880 115 197
In [273]:
# Performing One-Hot Encoding for "Make", "Model", "Type", "Origin", and "DriveTrain"
df_dum = pd.get_dummies(df, columns=["Make", "Model", "Type", "Origin", "DriveTrain"])
In [274]:
df_dum
Out[274]:
MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length ... Type_Sedan Type_Sports Type_Truck Type_Wagon Origin_Asia Origin_Europe Origin_USA DriveTrain_All DriveTrain_Front DriveTrain_Rear
0 36945 33337 3.5 6.0 265 17 23 4451 106 189 ... 0 0 0 0 1 0 0 1 0 0
1 23820 21761 2.0 4.0 200 24 31 2778 101 172 ... 1 0 0 0 1 0 0 0 1 0
2 26990 24647 2.4 4.0 200 22 29 3230 105 183 ... 1 0 0 0 1 0 0 0 1 0
3 33195 30299 3.2 6.0 270 20 28 3575 108 186 ... 1 0 0 0 1 0 0 0 1 0
4 43755 39014 3.5 6.0 225 18 24 3880 115 197 ... 1 0 0 0 1 0 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
423 40565 38203 2.4 5.0 197 21 28 3450 105 186 ... 1 0 0 0 0 1 0 0 1 0
424 42565 40083 2.3 5.0 242 20 26 3450 105 186 ... 1 0 0 0 0 1 0 0 1 0
425 45210 42573 2.9 6.0 268 19 26 3653 110 190 ... 1 0 0 0 0 1 0 0 1 0
426 26135 24641 1.9 4.0 170 22 29 2822 101 180 ... 0 0 0 1 0 1 0 0 1 0
427 35145 33112 2.5 5.0 208 20 27 3823 109 186 ... 0 0 0 1 0 1 0 1 0 0

426 rows × 483 columns

In [275]:
df_data = df_dum.drop(["Invoice"], axis = 1)
# Invoice feature do not contribute to car price prediction 
In [276]:
df_data
Out[276]:
MSRP EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length Make_Acura ... Type_Sedan Type_Sports Type_Truck Type_Wagon Origin_Asia Origin_Europe Origin_USA DriveTrain_All DriveTrain_Front DriveTrain_Rear
0 36945 3.5 6.0 265 17 23 4451 106 189 1 ... 0 0 0 0 1 0 0 1 0 0
1 23820 2.0 4.0 200 24 31 2778 101 172 1 ... 1 0 0 0 1 0 0 0 1 0
2 26990 2.4 4.0 200 22 29 3230 105 183 1 ... 1 0 0 0 1 0 0 0 1 0
3 33195 3.2 6.0 270 20 28 3575 108 186 1 ... 1 0 0 0 1 0 0 0 1 0
4 43755 3.5 6.0 225 18 24 3880 115 197 1 ... 1 0 0 0 1 0 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
423 40565 2.4 5.0 197 21 28 3450 105 186 0 ... 1 0 0 0 0 1 0 0 1 0
424 42565 2.3 5.0 242 20 26 3450 105 186 0 ... 1 0 0 0 0 1 0 0 1 0
425 45210 2.9 6.0 268 19 26 3653 110 190 0 ... 1 0 0 0 0 1 0 0 1 0
426 26135 1.9 4.0 170 22 29 2822 101 180 0 ... 0 0 0 1 0 1 0 0 1 0
427 35145 2.5 5.0 208 20 27 3823 109 186 0 ... 0 0 0 1 0 1 0 1 0 0

426 rows × 482 columns

In [277]:
# Feeding input features to X and output (MSRP) to y
X = df_data.drop("MSRP", axis = 1)
y = df_data["MSRP"]
In [278]:
X = np.array(X)
In [279]:
y = np.array(y)
In [280]:
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size = 0.2)
In [281]:
df_data.shape
Out[281]:
(426, 482)
In [282]:
X_train.shape
Out[282]:
(340, 481)
In [283]:
X_test.shape
Out[283]:
(86, 481)
In [337]:
y_train.shape
Out[337]:
(340,)
In [338]:
y_test.shape
Out[338]:
(86,)

# 6.1: TRAINING AND EVALUATING MULTIPLE LINEAR REGRESSION¶

In [284]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, accuracy_score
from math import sqrt
In [285]:
LinearRegression_model= LinearRegression()
LinearRegression_model.fit(X_train, y_train)
Out[285]:
LinearRegression()
In [286]:
accuracy_LinearRegression = LinearRegression_model.score(X_test, y_test)
accuracy_LinearRegression
Out[286]:
0.8432973844392416
In [349]:
print("ACCURACY of Multiple Linear Regression: ",round(accuracy_LinearRegression*100,2),"%")
ACCURACY of Multiple Linear Regression:  84.33 %

# 6.2: TRAINING AND EVALUATING DECISION TREE¶

In [287]:
from sklearn.tree import DecisionTreeRegressor

DecisionTree_model = DecisionTreeRegressor()
DecisionTree_model.fit(X_train, y_train)
Out[287]:
DecisionTreeRegressor()
In [288]:
accuracy_DecisionTree = DecisionTree_model.score(X_test, y_test)
accuracy_DecisionTree
Out[288]:
0.8825769736325962
In [350]:
print("ACCURACY of Decision Tree:",round(accuracy_DecisionTree*100,2),"%")
ACCURACY of Decision Tree: 88.26 %

# 6.3: TRAINING AND EVALUATING RANDOM FOREST REGRESSION¶

In [289]:
from sklearn.ensemble import RandomForestRegressor
In [290]:
RandomForest_model = RandomForestRegressor(n_estimators= 5, max_depth= 5)
RandomForest_model.fit(X_train, y_train)
Out[290]:
RandomForestRegressor(max_depth=5, n_estimators=5)
In [291]:
accuracy_RandomForest= RandomForest_model.score(X_test, y_test)
accuracy_RandomForest
Out[291]:
0.8431022604395081
In [351]:
print("ACCURACY of Random Forest Regression:",round(accuracy_RandomForest*100,2),"%")
ACCURACY of Random Forest Regression: 84.31 %

# 6.4: TRAINING AND EVALUATING XGBOOST ALGORITHM¶

In [292]:
from xgboost import XGBRegressor
In [293]:
model = XGBRegressor()
model.fit(X_train, y_train)
Out[293]:
XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=0,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, ...)
In [294]:
accuracy_XGBoost = model.score(X_test, y_test)
accuracy_XGBoost
Out[294]:
0.9368475564239244
In [352]:
print("ACCURACY of XGBOOST Algorithm:",round(accuracy_XGBoost*100,2),"%")
ACCURACY of XGBOOST Algorithm: 93.68 %

# 7: COMPARING MODELS AND CALCULATING REGRESSION KPIs¶

In [295]:
y_predict_linear = LinearRegression_model.predict(X_test)

fig = sns.regplot(y_predict_linear, y_test, color = 'red', marker = "^")
fig.set(title = "Linear Regression Model", xlabel = "Predicted Price of the used cars ($)", ylabel = "Actual Price of the used cars ($)")
Out[295]:
[Text(0.5, 1.0, 'Linear Regression Model'),
 Text(0.5, 0, 'Predicted Price of the used cars ($)'),
 Text(0, 0.5, 'Actual Price of the used cars ($)')]
In [296]:
RMSE= float(format(np.sqrt(mean_squared_error(y_test, y_predict_linear)), ".3f"))
MSE= mean_squared_error(y_test, y_predict_linear)
MAE= mean_absolute_error(y_test, y_predict_linear)
r2= r2_score(y_test, y_predict_linear)

print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2) 
RMSE = 6976.547 
MSE = 48672211.62598064 
MAE = 5060.856301243534 
R2 = 0.8432973844392416
In [297]:
y_predict_RandomForest = RandomForest_model.predict(X_test)

fig = sns.regplot(y_predict_RandomForest, y_test, color = 'blue', marker = "s")
fig.set(title = "Random Forest Regression Model", xlabel = "Predicted Price of the used cars ($)", ylabel= "Actual Price of the used cars ($)")
Out[297]:
[Text(0.5, 1.0, 'Random Forest Regression Model'),
 Text(0.5, 0, 'Predicted Price of the used cars ($)'),
 Text(0, 0.5, 'Actual Price of the used cars ($)')]
In [298]:
RMSE= float(format(np.sqrt(mean_squared_error(y_test, y_predict_RandomForest)), ".3f"))
MSE= mean_squared_error(y_test, y_predict_RandomForest)
MAE= mean_absolute_error(y_test, y_predict_RandomForest)
r2= r2_score(y_test, y_predict_RandomForest)

print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2) 
RMSE = 6980.889 
MSE = 48732817.6125135 
MAE = 4217.669264831376 
R2 = 0.8431022604395081
In [299]:
y_predict_XGBoost = model.predict(X_test)

fig = sns.regplot(y_predict_XGBoost, y_test, color = 'green', marker = "D")
fig.set(title = "XGBoost Model", xlabel = "Predicted Price of the used cars ($)", ylabel = "Actual Price of the used cars ($)")
Out[299]:
[Text(0.5, 1.0, 'XGBoost Model'),
 Text(0.5, 0, 'Predicted Price of the used cars ($)'),
 Text(0, 0.5, 'Actual Price of the used cars ($)')]
In [300]:
RMSE = float(format(np.sqrt(mean_squared_error(y_test, y_predict_XGBoost)), ".3f"))
MSE = mean_squared_error(y_test, y_predict_XGBoost)
MAE = mean_absolute_error(y_test, y_predict_XGBoost)
r2 = r2_score(y_test, y_predict_XGBoost)

print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2) 
RMSE = 4428.917 
MSE = 19615301.80867185 
MAE = 3039.6424305050873 
R2 = 0.9368475564239244

# 8: CONCLUSION¶

XGBoost model scores 94% accuracy which clearly outperforms the Multiple Linear Regression(84%), Decision Tree(88%) and Random Forest Regression(84%) models.¶